Ranking Hotel Listings on Expedia
May 2022 ~ MSc Course "Data Mining Techniques"
Length: 1mo (at 0.25 FTE)
Programming language: Python (Pandas, time, datetime, scikit-learn, LightGBM)
Data: Roughly five million search queries for a hotel, containing the
properties of the retrieved estates, details about the user, and whether the visitor clicked on
the hotel and booked it
Problem description:
Rank hotel listings based on their likelihood of being booked
Approach:
First, the data was explored, investigating its shape and then the types, missing values,
and distributions of its features. Secondly, new variables were derived, the missing entries
filled, and the categorical features one-hot-encoded. Next, the data was split into the
train (90%) and validation (10%) sets. Lastly, two LightGBM models were designed, a regressor
(pointwise approach) and a ranker (listwise approach).
Results:
After testing different target variables and encoding techniques for the categorical features,
multiple combinations of hyperparameters were compared using a 4-fold Cross-Validation method.
The NDCG@5 of the regressor and ranker estimators were evaluated at 0.350 and 0.375, respectively,
on the validation data. Next, the models with the best hyperparameters were trained on all the
data (including the held-out set) and evaluated on the competition test set at 0.333 and 0.366,
respectively, suggesting a slight overfit on the validation data. The resulted NDCGs@5 were
interpreted in relation to the random baseline ranking (0.156). Accordingly, the outcomes
indicate that the developed models are over two times as good as listing properties randomly.
In other words, Expedia would satisfy their customers twice as much by using one of the two
models instead of randomly displaying listings.
Finally, in the picture below, one can see the feature importance of the final LGBMRanker
model. The most important variable is by far the property id, implying that specific properties
are significantly more likely to be booked than others when retrieved after the search.